Uncontrolled Interthread Interference in Main Memory Can Destroy Individ- Ual Threads’ Memory-level Parallelism, Effectively Serializing the Memory Requests of a Thread Whose Latencies Would Otherwise Have Largely
نویسندگان
چکیده
......The main memory (dynamic RAM) system is a major limiter of computer system performance. In modern processors, which are overwhelmingly multicore (or multithreaded), the concurrently executing threads share the DRAM system, and different threads running on different cores can delay each other through resource contention. One thread’s memory requests can cause DRAM bank conflicts, row-buffer conflicts, and data/address bus conflicts with another’s. As the number of on-chip cores increases, the pressure on the DRAM system increases, as does the interference among threads sharing the system. Unfortunately, many conventional DRAM controllers are unaware of this interthread interference. They schedule requests simply to maximize DRAM data throughput. For example, the commonly used row-hit-first (FR-FCFS, or first ready, first come, first served) scheduling policy is thread unaware. Uncontrolled interthread interference in DRAM scheduling results in two major problems. First, as previous work showed, a state-of-the-art DRAM controller can unfairly prioritize some threads while starving more important threads for long time periods, as they wait to access memory (see the ‘‘Related Work on Memory Controllers’’ sidebar). For example, FR-FCFS unfairly prioritizes threads with high row-buffer hit rates over those witho low row-buffer hit rates. Similarly, an oldest-first scheduling policy implicitly prioritizes memoryintensive threads over memory-nonintensive ones. In fact, it is possible to write programs to deny DRAM service to more important programs running on the same chip, as we showed in our previous work. Such mmi2009010022.3d 31/1/09 16:11 Page 22
منابع مشابه
Enhancing the Performance and Fairness of Shared DRAM Systems with Parallelism-Aware Batch Scheduling
Enhancing the Performance and Fairness of Shared DRAM Systems with Parallelism-Aware Batch Scheduling Onur Mutlu Thomas Moscibroda Microsoft Research Abstract In a chip-multiprocessor (CMP) system, the DRAM system is shared among cores. In a shared DRAM system, requests from a thread can not only delay requests from other threads by causing bank/bus/row-buffer conflicts but they can also destro...
متن کاملParallelism-Aware Batch Scheduling: Paving the Way to High-Performance and Fair Memory Controllers
In modern processors, the DRAM system is shared among concurrently-executing threads. Memory requests from a thread can delay requests from other threads by causing bank/bus/rowbuffer conflicts. Conventional DRAM controllers are unaware of inter-thread interference, which causes two problems. First, some threads are unfairly penalized and denied DRAM service for long time periods. Second, as we...
متن کاملMemory Compression Coordinated and Optimized Prefetching in GPU Architectures
Traditionally, GPU architectures have been primarily focused on throughput and latency hiding. However, as the computational power of GPUs continues to scale with Moore’s law, an increasing number of applications are becoming limited by memory bandwidth [1]. Also, data locality and reuse are becoming increasingly important with power-limited technology scaling. The energy spent on off-chip memo...
متن کاملPrefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor
Simultaneous Multi-threading (SMT) has been developed to increase instruction level parallelism by allowing instructions from a different thread to run during a stall. Inter-thread cache interference, however, might limit the benefit of running multiple independent threads. SMT processors can be utilized in a different model, where a helper thread is used to prefetch cache blocks for the main e...
متن کاملMultigranular Thread Support in WaveScalar
WaveScalar is a recently proposed scalable microarchitecture. The original WaveScalar research developed and evaluated an ISA and microarchitecture that efficiently executes a single, coarse-grain thread. In this paper, we expand that design to support multiple, simultaneously executing threads. Four mechanisms make this possible: (1) instructions that enable and disable wave-ordered memory; (2...
متن کامل